Using devices such as Jawbone Up, Nike FuelBand, and Fitbit it is now possible to collect a large amount of data about personal activity relatively inexpensively. These type of devices are part of the quantified self movement – a group of enthusiasts who take measurements about themselves regularly to improve their health, to find patterns in their behavior, or because they are tech geeks. One thing that people regularly do is quantify how much of a particular activity they do, but they rarely quantify how well they do it. In this project, your goal will be to use data from accelerometers on the belt, forearm, arm, and dumbell of 6 participants. They were asked to perform barbell lifts correctly and incorrectly in 5 different ways. My goal here is to predict the “class” with the help of other predictors.
Let’s load the data. I have downloaded the data already on my local system.
dfTrain <- read.csv("pml-training.csv", stringsAsFactors = F,na.strings = c("","NA","#DIV/0!"))
dfTest <- read.csv("pml-testing.csv", stringsAsFactors = F,na.strings = c("","NA","#DIV/0!"))
dim(dfTrain); dim(dfTest)
## [1] 19622 160
## [1] 20 160
Let’s create a validation for model tuning:
#for reproducability
set.seed(101)
inTrain <- createDataPartition(dfTrain$classe, p = 0.8, list = F)
dfVal <- dfTrain[-inTrain,]
dfTrain <- dfTrain[inTrain,]
dim(dfTrain); dim(dfVal)
## [1] 15699 160
## [1] 3923 160
Now 3 partition of our data is ready, lets dive into analysis but 1st lets look at the proportion of different “classe”:
table(dfTrain$classe)/nrow(dfTrain)
##
## A B C D E
## 0.2843493 0.1935155 0.1744060 0.1638958 0.1838334
From the above it is clear that there are not that much bias in the data in term of different “classe”.
The data has 160 columns and for training data 15699 rows. Data was collected with the help of 4 sensors, shown in the below diagram.
Few Key points about the columns:
Let’s take a quick look at the missingness of the data. As the no of feature is large, its better to see them by the 4 sensors:
For Belt sensor:
belt_miss <- sapply(select(dfTrain,names(dfTrain)[grepl("_belt",names(dfTrain))]),
function(x) sum(is.na(x)))
belt_miss
## roll_belt pitch_belt yaw_belt
## 0 0 0
## total_accel_belt kurtosis_roll_belt kurtosis_picth_belt
## 0 15396 15413
## kurtosis_yaw_belt skewness_roll_belt skewness_roll_belt.1
## 15699 15395 15413
## skewness_yaw_belt max_roll_belt max_picth_belt
## 15699 15388 15388
## max_yaw_belt min_roll_belt min_pitch_belt
## 15396 15388 15388
## min_yaw_belt amplitude_roll_belt amplitude_pitch_belt
## 15396 15388 15388
## amplitude_yaw_belt var_total_accel_belt avg_roll_belt
## 15396 15388 15388
## stddev_roll_belt var_roll_belt avg_pitch_belt
## 15388 15388 15388
## stddev_pitch_belt var_pitch_belt avg_yaw_belt
## 15388 15388 15388
## stddev_yaw_belt var_yaw_belt gyros_belt_x
## 15388 15388 0
## gyros_belt_y gyros_belt_z accel_belt_x
## 0 0 0
## accel_belt_y accel_belt_z magnet_belt_x
## 0 0 0
## magnet_belt_y magnet_belt_z
## 0 0
For Arm sensor:
arm_miss <- sapply(select(dfTrain,names(dfTrain)[grepl("_arm",names(dfTrain))]),
function(x) sum(is.na(x)))
arm_miss
## roll_arm pitch_arm yaw_arm total_accel_arm
## 0 0 0 0
## var_accel_arm avg_roll_arm stddev_roll_arm var_roll_arm
## 15388 15388 15388 15388
## avg_pitch_arm stddev_pitch_arm var_pitch_arm avg_yaw_arm
## 15388 15388 15388 15388
## stddev_yaw_arm var_yaw_arm gyros_arm_x gyros_arm_y
## 15388 15388 0 0
## gyros_arm_z accel_arm_x accel_arm_y accel_arm_z
## 0 0 0 0
## magnet_arm_x magnet_arm_y magnet_arm_z kurtosis_roll_arm
## 0 0 0 15446
## kurtosis_picth_arm kurtosis_yaw_arm skewness_roll_arm skewness_pitch_arm
## 15448 15398 15445 15448
## skewness_yaw_arm max_roll_arm max_picth_arm max_yaw_arm
## 15398 15388 15388 15388
## min_roll_arm min_pitch_arm min_yaw_arm amplitude_roll_arm
## 15388 15388 15388 15388
## amplitude_pitch_arm amplitude_yaw_arm
## 15388 15388
For Forearm sensor:
forearm_miss <- sapply(select(dfTrain,
names(dfTrain)[grepl("_forearm",names(dfTrain))]),
function(x) sum(is.na(x)))
forearm_miss
## roll_forearm pitch_forearm yaw_forearm
## 0 0 0
## kurtosis_roll_forearm kurtosis_picth_forearm kurtosis_yaw_forearm
## 15448 15449 15699
## skewness_roll_forearm skewness_pitch_forearm skewness_yaw_forearm
## 15447 15449 15699
## max_roll_forearm max_picth_forearm max_yaw_forearm
## 15388 15388 15448
## min_roll_forearm min_pitch_forearm min_yaw_forearm
## 15388 15388 15448
## amplitude_roll_forearm amplitude_pitch_forearm amplitude_yaw_forearm
## 15388 15388 15448
## total_accel_forearm var_accel_forearm avg_roll_forearm
## 0 15388 15388
## stddev_roll_forearm var_roll_forearm avg_pitch_forearm
## 15388 15388 15388
## stddev_pitch_forearm var_pitch_forearm avg_yaw_forearm
## 15388 15388 15388
## stddev_yaw_forearm var_yaw_forearm gyros_forearm_x
## 15388 15388 0
## gyros_forearm_y gyros_forearm_z accel_forearm_x
## 0 0 0
## accel_forearm_y accel_forearm_z magnet_forearm_x
## 0 0 0
## magnet_forearm_y magnet_forearm_z
## 0 0
For Dumbbell sensor:
dumbbell_miss <- sapply(select(dfTrain,
names(dfTrain)[grepl("_dumbbell",names(dfTrain))]),
function(x) sum(is.na(x)))
dumbbell_miss
## roll_dumbbell pitch_dumbbell yaw_dumbbell
## 0 0 0
## kurtosis_roll_dumbbell kurtosis_picth_dumbbell kurtosis_yaw_dumbbell
## 15392 15390 15699
## skewness_roll_dumbbell skewness_pitch_dumbbell skewness_yaw_dumbbell
## 15391 15389 15699
## max_roll_dumbbell max_picth_dumbbell max_yaw_dumbbell
## 15388 15388 15392
## min_roll_dumbbell min_pitch_dumbbell min_yaw_dumbbell
## 15388 15388 15392
## amplitude_roll_dumbbell amplitude_pitch_dumbbell amplitude_yaw_dumbbell
## 15388 15388 15392
## total_accel_dumbbell var_accel_dumbbell avg_roll_dumbbell
## 0 15388 15388
## stddev_roll_dumbbell var_roll_dumbbell avg_pitch_dumbbell
## 15388 15388 15388
## stddev_pitch_dumbbell var_pitch_dumbbell avg_yaw_dumbbell
## 15388 15388 15388
## stddev_yaw_dumbbell var_yaw_dumbbell gyros_dumbbell_x
## 15388 15388 0
## gyros_dumbbell_y gyros_dumbbell_z accel_dumbbell_x
## 0 0 0
## accel_dumbbell_y accel_dumbbell_z magnet_dumbbell_x
## 0 0 0
## magnet_dumbbell_y magnet_dumbbell_z
## 0 0
So it is very interesting to see that few of the features are over 90% missing, I would drop those columns for further analysis. But the interesting thing is that all of those columns have same no of NA values.
column_2drop <- c(names(belt_miss[belt_miss != 0]),
names(arm_miss[arm_miss != 0]),
names(forearm_miss[forearm_miss != 0]),
names(dumbbell_miss[dumbbell_miss != 0]))
length(column_2drop)
## [1] 100
So we can drop 100 column as they are mostly missing. After we drop these column there will be 52 predictors left.
Now lets get into analysis, first let’s look at the correlation among the predictors.
#dropping the cols
dfAnalize <- tbl_df(dfTrain %>%
select(-column_2drop,
-c(X,user_name, raw_timestamp_part_1,
raw_timestamp_part_2, cvtd_timestamp,
new_window,num_window)))
## Warning: `tbl_df()` was deprecated in dplyr 1.0.0.
## Please use `tibble::as_tibble()` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was generated.
## Note: Using an external vector in selections is ambiguous.
## i Use `all_of(column_2drop)` instead of `column_2drop` to silence this message.
## i See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
## This message is displayed once per session.
dfAnalize$classe <- as.factor(dfAnalize$classe)
dfAnalize[,1:52] <- lapply(dfAnalize[,1:52],as.numeric)
dim(dfAnalize)
## [1] 15699 53
corr_col <- cor(select(dfAnalize, -classe))
diag(corr_col) <- 0
corr_col <- which(abs(corr_col)>0.8,arr.ind = T)
corr_col <- unique(row.names(corr_col))
corrplot(cor(select(dfAnalize,corr_col)),
type="upper", order="hclust",method = "number")
## Note: Using an external vector in selections is ambiguous.
## i Use `all_of(corr_col)` instead of `corr_col` to silence this message.
## i See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
## This message is displayed once per session.
Here I have subsetted the data to show only the columns for which absolute correlation is higher than 0.8 with at least one other column. From Correlation plot it is clear that there is lot of columns that are highly correlated. That might be an issue when we will be in modeling phase. Either we can drop those columns or we can perform PCA(Principal Components Analysis). One important thing to note from this graph is that high correlation is only seen between the same sensor i.e. “belt”,“arm”,“forearm” and “dumbbell”.
As the target is a categorical variable, we cannot check correlation with the other variables directly. But we can use correlationfunnel::correlate to see the correlation with each level of”classe” and other features. Lets go by them one by one.
# binarizing data
#correlationfunnel website: https://business-science.github.io/correlationfunnel/
corr_funl_df <- dfAnalize %>% binarize(n_bins = 4, thresh_infreq = 0.01)
corr_a <- corr_funl_df %>% correlate(target = classe__A)
corr_a %>% plot_correlation_funnel(interactive = T,limits = c(-0.5,0.5))
## Warning: `gather_()` was deprecated in tidyr 1.2.0.
## Please use `gather()` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was generated.
For *classe__A* it seems that the “Arm and Forearm” sensors are more important.
corr_b <- corr_funl_df %>% correlate(target = classe__B)
corr_b %>% plot_correlation_funnel(interactive = T,limits = c(-0.5,0.5))
For *classe__B* it seems that the “Dumbbell and Belt” sensors are more important.
corr_c <- corr_funl_df %>% correlate(target = classe__C)
corr_c %>% plot_correlation_funnel(interactive = T,limits = c(-0.5,0.5))
For *classe__C* it seems that the “Dumbbell” sensors are more important.
corr_d <- corr_funl_df %>% correlate(target = classe__D)
corr_d %>% plot_correlation_funnel(interactive = T,limits = c(-0.5,0.5))
For *classe__D* it seems that the “Forearm, Arm and Dumbbell” sensors are more important.
corr_e <- corr_funl_df %>% correlate(target = classe__E)
corr_e %>% plot_correlation_funnel(interactive = T,limits = c(-0.5,0.5))
For *classe__E* it seems that the “Belt” sensors are more important.
This document is already too long coursera assignment, so for this section I’ll work on top 5 features for each class selected in the last section. So lets select only those columns.
#subseting dfAnalize
col_a <- c("magnet_arm_x", "pitch_forearm" , "magnet_dumbbell_y",
"roll_forearm", "gyros_dumbbell_y")
col_b <- c("magnet_dumbbell_y", "magnet_dumbbell_x" , "roll_dumbbell" ,
"magnet_belt_y" , "accel_dumbbell_x" )
col_c <- c("magnet_dumbbell_y", "roll_dumbbell" , "accel_dumbbell_y" ,
"magnet_dumbbell_x", "magnet_dumbbell_z")
col_d <- c("pitch_forearm" , "magnet_arm_y" , "magnet_forearm_x",
"accel_dumbbell_y", "accel_forearm_x")
col_e <- c("magnet_belt_y" , "magnet_belt_z" , "roll_belt",
"gyros_belt_z" , "magnet_dumbbell_y")
final_cols <- character()
for(c in c(col_a,col_b,col_c,col_d,col_e)){
final_cols <- union(final_cols, c)
}
dfAnalize2 <- dfAnalize %>% select(final_cols, classe)
## Note: Using an external vector in selections is ambiguous.
## i Use `all_of(final_cols)` instead of `final_cols` to silence this message.
## i See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
## This message is displayed once per session.
data.frame("arm" = sum(grepl("_arm",final_cols)),
"forearm" = sum(grepl("_forearm",final_cols)),
"belt" = sum(grepl("_belt",final_cols)),
"dumbbell" = sum(grepl("_dumbbell",final_cols)))
## arm forearm belt dumbbell
## 1 2 4 4 7
One interesting thing to note here is that the dumbbell sensor turned out to be the most important sensor among the 4. I would like to explore that in future works.
my_dens <- function(data, mapping, ...) {
ggplot(data = data, mapping=mapping) +
geom_density(..., alpha = 0.3)+scale_fill_brewer(palette="Set2")
}
my_point <- function(data, mapping, ...) {
ggplot(data = data, mapping=mapping) +
geom_point(..., alpha = 0.1)+ scale_fill_brewer(palette="Set2")
}
ggpairs(dfAnalize2, columns = 1:5,aes(color = classe),
lower = list(continuous = my_point),diag = list(continuous = my_dens))
ggpairs(dfAnalize2, columns = 6:10,aes(color = classe),
lower = list(continuous = my_point),diag = list(continuous = my_dens))
ggpairs(dfAnalize2, columns = 11:17,aes(color = classe),
lower = list(continuous = my_point),diag = list(continuous = my_dens))
So we can see that most of the features are very skewed, so as a preprocessing step we have to “center”, “rescale” and use “BoxCox” the features.
Some of the features have very interesting scatter plot. But as I am in a time constrain for this project I’ll revisit this later.
In the above section we have narrowed down to 17 predictors and also we have decided to use 3 preprocessing steps. In this section we will build on the analyzed data to create models for prediction. We will use Classification tree, Random Forest, Generalized Linear regression and SVM , them we will stack it with a Random forest to get the final model.
dfTrainF <- dfTrain %>% select(final_cols,classe)
dfValF <- dfVal %>% select(final_cols,classe)
dfTrainF[,1:17] <- sapply(dfTrainF[,1:17],as.numeric)
dfValF[,1:17] <- sapply(dfValF[,1:17],as.numeric)
levels <- c("A", "B", "C", "D", "E")
preprop_obj <- preProcess(dfTrainF[,-18],method = c("center","scale","BoxCox"))
xTrain <- predict(preprop_obj,select(dfTrainF,-classe))
yTrain <- factor(dfTrainF$classe,levels=levels)
xVal <- predict(preprop_obj,select(dfValF,-classe))
yVal <- factor(dfValF$classe,levels=levels)
trControl <- trainControl(method="cv", number=5)
#CFtree
modelCT <- train(x = xTrain,y = yTrain,
method = "rpart", trControl = trControl)
#RF
modelRF <- train(x = xTrain,y = yTrain,
method = "rf", trControl = trControl,verbose=FALSE, metric = "Accuracy")
#SVM
modelSVM <- svm(x = xTrain,y = yTrain,
kernel = "polynomial", cost = 10)
Let’s look the results:
confusionMatrix(predict(modelCT,xVal),yVal)
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1003 330 319 294 106
## B 19 256 20 109 103
## C 93 173 345 240 212
## D 0 0 0 0 0
## E 1 0 0 0 300
##
## Overall Statistics
##
## Accuracy : 0.4853
## 95% CI : (0.4696, 0.5011)
## No Information Rate : 0.2845
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.3271
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.8987 0.33729 0.50439 0.0000 0.41609
## Specificity 0.6263 0.92067 0.77833 1.0000 0.99969
## Pos Pred Value 0.4888 0.50493 0.32455 NaN 0.99668
## Neg Pred Value 0.9396 0.85275 0.88147 0.8361 0.88377
## Prevalence 0.2845 0.19347 0.17436 0.1639 0.18379
## Detection Rate 0.2557 0.06526 0.08794 0.0000 0.07647
## Detection Prevalence 0.5231 0.12924 0.27097 0.0000 0.07673
## Balanced Accuracy 0.7625 0.62898 0.64136 0.5000 0.70789
Clearly Classification tree is not performing well, accuracy is very low. One thing to note here is that True classe_A are detected with high accuracy, but other classe are incorrectly predicted as classe_A.
confusionMatrix(predict(modelRF,xVal),yVal)
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1112 7 0 0 0
## B 3 741 5 3 1
## C 1 7 676 15 4
## D 0 4 3 625 1
## E 0 0 0 0 715
##
## Overall Statistics
##
## Accuracy : 0.9862
## 95% CI : (0.9821, 0.9896)
## No Information Rate : 0.2845
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9826
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.9964 0.9763 0.9883 0.9720 0.9917
## Specificity 0.9975 0.9962 0.9917 0.9976 1.0000
## Pos Pred Value 0.9937 0.9841 0.9616 0.9874 1.0000
## Neg Pred Value 0.9986 0.9943 0.9975 0.9945 0.9981
## Prevalence 0.2845 0.1935 0.1744 0.1639 0.1838
## Detection Rate 0.2835 0.1889 0.1723 0.1593 0.1823
## Detection Prevalence 0.2852 0.1919 0.1792 0.1614 0.1823
## Balanced Accuracy 0.9970 0.9862 0.9900 0.9848 0.9958
plot(modelRF$finalModel,main="Error VS no of tree")
Random Forest took the lead with 98%+ accuracy.
confusionMatrix(predict(modelSVM,xVal),yVal)
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1096 40 18 17 2
## B 1 676 15 5 6
## C 9 40 640 45 3
## D 10 3 9 575 9
## E 0 0 2 1 701
##
## Overall Statistics
##
## Accuracy : 0.9401
## 95% CI : (0.9322, 0.9473)
## No Information Rate : 0.2845
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9241
##
## Mcnemar's Test P-Value : 1.808e-15
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.9821 0.8906 0.9357 0.8942 0.9723
## Specificity 0.9726 0.9915 0.9701 0.9905 0.9991
## Pos Pred Value 0.9344 0.9616 0.8684 0.9488 0.9957
## Neg Pred Value 0.9927 0.9742 0.9862 0.9795 0.9938
## Prevalence 0.2845 0.1935 0.1744 0.1639 0.1838
## Detection Rate 0.2794 0.1723 0.1631 0.1466 0.1787
## Detection Prevalence 0.2990 0.1792 0.1879 0.1545 0.1795
## Balanced Accuracy 0.9773 0.9411 0.9529 0.9424 0.9857
So it clear that SVM is giving good accuracy, though it not as good as random forest.
So from the above analysis it is clear that Random Forest is taking the lead in term of prediction. Now lets see how good it does in the Test set given in the Coursera project.
dfTest2 <- dfTest %>% select(final_cols,problem_id)
xTest <- dfTest2 %>% select(final_cols)
result <- data.frame("problem_id" = dfTest$problem_id,
"PREDICTION_RF" = predict(modelRF,xTest),
"PREDICTION_SVM" = predict(modelSVM,xTest))
result
## problem_id PREDICTION_RF PREDICTION_SVM
## 1 1 E C
## 2 2 A A
## 3 3 A B
## 4 4 E C
## 5 5 E A
## 6 6 E C
## 7 7 E B
## 8 8 B A
## 9 9 A E
## 10 10 E E
## 11 11 A C
## 12 12 A C
## 13 13 E E
## 14 14 A B
## 15 15 E B
## 16 16 E A
## 17 17 E C
## 18 18 B A
## 19 19 E A
## 20 20 E E
My validation accuracy is good but my score in the quize was not correct. Maybe I should have not droped the columns when I selected 17 features. As Random Forest is doing so go I’ll train one more RF with all the data and see if it helps. But still I dont understand why even after my accuracy on validation is above 95, how it is so bad. Need to look into it.
So for this update I will use all the predictors. And as there are so many columns I’ll make it parallel so it doesnot take that long.
dfTrainF2 <- tbl_df(dfTrain %>%
select(-column_2drop,
-c(X,user_name, raw_timestamp_part_1,
raw_timestamp_part_2, cvtd_timestamp,
new_window,num_window)))
xTrain2 <- dfTrainF2 %>% select(-classe)
xTrain2 <- sapply(xTrain2,as.numeric)
yTrain2 <- factor(dfTrainF2$classe,levels=levels)
dfValF2 <- tbl_df(dfVal %>%
select(-column_2drop,
-c(X,user_name, raw_timestamp_part_1,
raw_timestamp_part_2, cvtd_timestamp,
new_window,num_window)))
xVal2 <- dfValF2 %>% select(-classe)
xVal2 <- sapply(xVal2,as.numeric)
yVal2 <- factor(dfValF2$classe,levels=levels)
dfTestF2 <- tbl_df(dfTest %>%
select(-column_2drop,
-c(X,user_name, raw_timestamp_part_1,
raw_timestamp_part_2, cvtd_timestamp,
new_window,num_window)))
xTest2 <- dfTestF2 %>% select(-problem_id)
xTest2 <- sapply(xTest2,as.numeric)
pb_id <- dfValF2$classe
library(doParallel)
## Warning: package 'doParallel' was built under R version 4.1.3
## Loading required package: foreach
## Warning: package 'foreach' was built under R version 4.1.3
## Loading required package: iterators
## Warning: package 'iterators' was built under R version 4.1.3
## Loading required package: parallel
ncores <- makeCluster(detectCores() - 1)
registerDoParallel(cores=ncores)
getDoParWorkers()
## [1] 3
modelRF2 <- train(x = xTrain2,y = yTrain2, method = "rf",
metric = "Accuracy",
trControl=trainControl(method = "cv", number = 4,
p= 0.60, allowParallel = TRUE ))
#Check the result
result2 <- data.frame("problem_id" = dfTest$problem_id,
"PREDICTION_RF" = predict(modelRF,xTest),
"PREDICTION_SVM" = predict(modelSVM,xTest),
"PREDICTION_RF2_ALL_COL"=predict(modelRF2,xTest2))
result2
## problem_id PREDICTION_RF PREDICTION_SVM PREDICTION_RF2_ALL_COL
## 1 1 E C B
## 2 2 A A A
## 3 3 A B B
## 4 4 E C A
## 5 5 E A A
## 6 6 E C E
## 7 7 E B D
## 8 8 B A B
## 9 9 A E A
## 10 10 E E A
## 11 11 A C B
## 12 12 A C C
## 13 13 E E B
## 14 14 A B A
## 15 15 E B E
## 16 16 E A E
## 17 17 E C A
## 18 18 B A B
## 19 19 E A B
## 20 20 E E B
This time I did better. May be because the test size is so small my previous models did so bad because my validation accuracy was ok.